Prediction with Bigrams

In this notebook I will show how to do predictions in text using a very simple scheme. The idea is to get the frequency of the letters for a certain text and for each letter predict the most frequent letter. This method is in my opinion the simplest method that do not use temporal information.

In order to obtain a prediction that is as simple and yet includes temporal information to make the prediction we calculate the most frequent bigrams (pair of letters like 'r' and 's' at the end of the word letters). We calculate for every letter what is the letter that is more likely to follow and use that as a prediction.

Setup


In [1]:
import nltk
from nltk.book import text7 as text


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

First we extract the information from the text.


In [2]:
letters = ' '.join(text)
letters = [letter.lower() for letter in letters] # Get the lowercase
symbols = set(letters)
Nletters = len(letters)
Nsymbols = len(symbols)

In [3]:
symbols


Out[3]:
{' ',
 '!',
 '#',
 '$',
 '%',
 '&',
 "'",
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 '\\',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

In [4]:
print('Number of letters', Nletters)
print('Nymbols', Nsymbols)


Number of letters 544269
Nymbols 54

We get the frequency for all the letters and the most common which turns out to be a space. Latter we will analyze how the result changes when we remove space from the whole analysis.


In [5]:
freq_letters = nltk.FreqDist(letters) # Get the most frequent letters
most_common_letter = freq_letters.most_common(1)[0][0]
print('most common letter', most_common_letter)


most common letter  

In [6]:
freq_letters.plot()

We will the bigrams frequency as well


In [7]:
bigrams = nltk.bigrams(letters)
freq_bigrams = nltk.FreqDist(bigrams)

Now we want to extract the next most probable letter for every letter. From the bigran frequency first we make a dictionary (master dictionary ) of all the next letters and their frequency for each letter (each symbol here). In particular we use a dictionary where the key is the symbol and the value is a list with a the tuples of the next letter and its frequency. With this in our hand we take for every list the one with the maximun frequency using a lambda function and build the next letter dictionary with it.


In [8]:
master_dictionary = {}
next_letters = {}

for symbol in symbols:
    master_dictionary[symbol] = [(key[1], value) for key,value in freq_bigrams.items() if key[0]==symbol]

for symbol in symbols:
    aux = max(master_dictionary[symbol], key=lambda x:x[1])  # Maximize over the second element of the tuple
    next_letters[symbol] = aux[0]

Predictions


In [9]:
prediction = 0
for letter in letters:
    if letter == most_common_letter:
        prediction += 1  # Get's the result right

prediction /= Nletters
print('Predictions using the most comon letter', prediction * 100.0)


Predictions using the most comon letter 18.497287187034352

In [10]:
# Now we make use of the temporal information
prediction_temp = 0
last_letter = None
for index, letter in enumerate(letters):
    if last_letter:  # If last_letter is not None
        if next_letters[last_letter] == letter:
            prediction_temp += 1
    # Save the last letter
    last_letter = letter

prediction_temp /= Nletters
print('Prediction using bigramsl information', prediction_temp * 100)


Prediction using bigramsl information 27.537486059283182

In [ ]: